# Lab 21 - k-means clustering

We will use the labor market data set from Lab 20.  It is available [here](http://comet.lehman.cuny.edu/owen/teaching/mat328/Nov2019_labor_market_majors.csv).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.cluster.hierarchy as shc

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans

from sklearn.metrics import confusion_matrix

from sklearn import datasets

%matplotlib inline
pd.set_option("display.max_columns", None)

### Clustering labor market data using k-means

Let's load the labor market data into a dataframe called `labor`.  Remember to skip the first 13 rows and the last 3 rows, and to set the `Major` column as the index.

Check your dataframe was created correctly.

Which two columns are not numerical types (integers or floats)?  Can you guess why?

Removes the commas in the `Median Wage Early Career` and `Median Wage Mid-Career` columns, and convert them to be of type float.

Check the conversion happened correctly.

As the k-means clustering algorithm also uses the distance between data points, we need to scale the data in each column to be between 0 and 1.

Put the scaled data back into a dataframe, using the column and index names from the original data set.

In [None]:
labor_scaled = pd.DataFrame(labor_scaled, columns = labor.columns, index = labor.index)
labor_scaled.head()

<details> <summary>Answer:</summary>
<code>labor_scaled = pd.DataFrame(labor_scaled, columns = labor.columns, index = labor.index)
</code>
</details>

Let's run k-means clustering on the data.  As with other sci-kit learn algorithm, we first create a KMeans variable (object) with the number of clusters, and then apply it to our data.

In [None]:
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans_clusters = kmeans.fit_predict(labor_scaled)
kmeans_clusters

Store these cluster assignments in the `labor` dataframe in a column called `kmeans_clusters`.

<details> <summary>Answer:</summary>
<code>labor["kmeans_clusters"] = kmeans_clusters
</code>
</details>

Now use the hierarchical clustering from Lab 20 to also cluster the data into 4 groups.

Store these cluster assignments in the `labor` dataframe in a column called `tree_cluster`.

Display the dataframe `labor` and check that that two new columns were added correctly.

Compare the clusters by using a filter to display only the rows with the first k-means clusters.  How does this cluster differ from the hierarchical one?

Now do the same thing for each of the other 3 k-means clusters.

Compute the confusion matrix for the k-means clusters and the tree clusters.

<details> <summary>Answer:</summary>
<code>confusion_matrix(labor["kmeans_clusters"],labor["tree_clusters"])
</code>
</details>

### Clustering digit images

Roughly follows the example [here])https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html_

Sci-kit learn contains images of hand-written digits.  Each digit has been encoded as an 8 pixel x 8 pixel image, and the 8x8 = 64 pixels are each stored as a number representing the darkness of that pixel.

See [here](https://scipy-lectures.org/packages/scikit-learn/auto_examples/plot_digits_simple_classif.html) for example images of the digits.

First we load the digits.  As with the other sci-kit learn datasets, it's stored as a dictionary.

In [None]:
digits = datasets.load_digits()

Display the possible keys.

We will just use the data directly as `digits.data` instead of making a dataframe from it (notice there is no list of features or column names here).

Run the k-means clustering algorithm with 10 clusters (for digits 0-9) on this data.

<details> <summary>Answer:</summary>
<code>kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
</code>
</details>

We can get the centers of the clusters using `kmeans.cluster_centers_`.  Try it below.

In [None]:
kmeans.cluster_centers_

These centers are not easy to interpret as numbers, but we can plot them with the following code.

In [None]:
# Create 10 sub-plots
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
# Split the data in 10 groups of 8x8
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
# For each of the 10 centers, using one of the 10 subplots:
for axi, center in zip(ax.flat, centers):
    # reset the x and y axes on that subplot
    axi.set(xticks=[], yticks=[])
    # plot the image
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)